fix(openai): self-heal stale Codex used% snapshots + lock semantics (#2994)#3039
Merged
Wei-Shaw merged 1 commit intoJun 5, 2026
Conversation
…ei-Shaw#2994) The OpenAI/Codex 5h "used %" inversion that caused fresh accounts to show ~96-99% used (PR Wei-Shaw#2918, commit b65dde6) was already reverted in Wei-Shaw#2993, so the stored value is now the correct "used %" again. This commit hardens that fix: 1. Regression test locking in direct "used %" semantics. The semantics have flip-flopped twice (Wei-Shaw#2918 -> Wei-Shaw#2993) with no value-level guard — a fresh account (secondary_used_percent=1, 5h window) must store codex_5h_used_percent=1, not 99. 2. Stale-bounded self-heal in resolveOpenAIQuotaUtilization (the single auto-pause chokepoint). An account poisoned with an inflated used% gets excluded from scheduling, and a paused account never receives traffic to refresh its snapshot — so it stayed stuck until the window's reset_at passed (up to 5h/7d). When codex_usage_updated_at is older than 2h, the account is no longer auto-paused on that snapshot; it gets one request whose response headers refresh the snapshot and self-heal it. A missing timestamp is treated as fresh (stays paused), and an actively-served exhausted account refreshes the timestamp every response so it never crosses the bound — it cannot escape auto-pause. No change to Normalize(); no 100-x reintroduced; no new dependency wiring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds regression guards and a “stale snapshot” escape hatch to prevent OpenAI Codex account auto-pause from permanently excluding accounts when usage snapshots are incorrect or no longer refreshed (issue #2994).
Changes:
- Add a regression test ensuring 5h/7d
used_percentsemantics are not inverted. - Introduce a staleness threshold (
openAICodexAutoPauseStaleAfter) so stale usage snapshots no longer keep accounts auto-paused. - Add scheduler tests validating stale snapshots allow one request to self-heal while fresh exhausted snapshots still pause.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| backend/internal/service/openai_gateway_service_codex_snapshot_test.go | Adds a regression test ensuring codex_5h_used_percent / codex_7d_used_percent are stored as “used%” (not inverted). |
| backend/internal/service/openai_gateway_service.go | Adds staleness-based bypass for auto-pause and helper to detect stale snapshots via codex_usage_updated_at. |
| backend/internal/service/openai_account_scheduler_test.go | Adds tests covering stale-snapshot bypass and ensuring fresh exhausted snapshots remain paused. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+931
to
+933
| "codex_5h_reset_at": time.Now().Add(time.Hour).Format(time.RFC3339), | ||
| // Snapshot is stale: older than openAICodexAutoPauseStaleAfter (2h). | ||
| "codex_usage_updated_at": time.Now().Add(-3 * time.Hour).Format(time.RFC3339), |
Comment on lines
+1514
to
+1517
| updatedAt, err := parseTime(fmt.Sprint(updatedRaw)) | ||
| if err != nil { | ||
| return false | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (#2994)
Newly-imported OpenAI/Codex OAuth accounts showed ~96–99% used in the 5h window even when nearly unused (a few cents). The inflated value also tripped
shouldAutoPauseOpenAIAccountByQuota, so the account was excluded from scheduling ("导致后续请求无法调度到这个账号").Root cause was a
100 - usedPercentinversion inNormalize()(commitb65dde63, PR #2918). For a fresh account whosex-codex-secondary-used-percent≈ 1, it stored100 - 1 = 99intocodex_5h_used_percent. That inversion was already reverted inmain(PR #2993). The stored value is now the correct "used %".What this PR adds (hardening, not a re-fix)
Regression test locking in the direct "used %" semantics. They have flip-flopped twice (fix(usage): 修正 OpenAI 5h 用量窗口 used%/remaining% 颠倒 #2918 → fix(usage): revert OpenAI 5h used_percent inversion (#2918 regression) #2993) with no value-level guard: a fresh account (
secondary_used_percent=1, 5h window) must storecodex_5h_used_percent=1, not 99.Stale-bounded self-heal in
resolveOpenAIQuotaUtilization— the single auto-pause chokepoint feeding the scheduler, WS forwarder, and gateway. An account already poisoned with an inflated used% gets excluded from scheduling, and a paused account never receives traffic to refresh its snapshot, so today it only recovers when the window'sreset_atpasses (≤5h/≤7d) or an admin opens its usage page (no background refresher exists). Now, whencodex_usage_updated_atis older than 2h, the account is no longer auto-paused on that snapshot; it gets one request whose response headers refresh the snapshot via the existingUpdateCodexUsageSnapshotFromHeaderspath and self-heal it.Safety
codex_usage_updated_atis treated as fresh (account stays paused) — no timestamp-less snapshot can silently escape auto-pause. Real poisoned snapshots always carry the timestamp.codex_usage_updated_aton every response, so it never crosses the 2h bound and stays paused. Guarded by a dedicated test.No change to
Normalize(); no100-xreintroduced; no new dependency wiring or mocks.Tests
TestBuildCodexUsageExtraUpdates_FreshAccountUsedPercentNotInverted_Issue2994TestOpenAIGatewayService_SelectAccountForModelWithExclusions_StaleUsageSnapshotSkipsPause_Issue2994TestOpenAIGatewayService_SelectAccountForModelWithExclusions_FreshExhaustedSnapshotStillPauses_Issue2994(guardrail)go test -tags=unit ./internal/service/...andgolangci-lint run ./internal/service/...pass.Fixes #2994
🤖 Generated with Claude Code